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Abstract. Collective classification models attempt to improve classi- 
fication performance by taking into account the class labels of related 
instances. However, they tend not to learn patterns of interactions be- 
tween classes and/or make the assumption that instances of the same 
class link to each other (assort at ivity assumption). Blockmodels provide 
a solution to these issues, being capable of modelling assortative and 
disassortative interactions, and learning the pattern of interactions in 
the form of a summary network. The Supervised Blockmodel provides 
good classification performance using link structure alone, whilst simul- 
taneously providing an interpretable summary of network interactions 
to allow a better understanding of the data. This work explores three 
variants of supervised blockmodels of varying complexity and tests them 
on four structurally different real world networks. 

Keywords: Collective Classification, Supervised Learning, Blockmod- 
elling, Node Classification, Statistical Network Analysis 

1 Introduction 

Probabilistic classification algorithms have long focused on the problem of pre- 
dicting unknown labels of data instances according to their attributes by lever- 
aging the conditional distributions of a supplied training set. These algorithms 
traditionally made an assumption that the data was independent and identically 
distributed, however many modern datasets break this assumption. As a result, 
research has shifted to examining how these relations or links can be exploited 
to improve classification performance. Collective classification is one approach 
which attempts this but often assumes that instances of a given class tend to 
link others of the same class, i.e that the class instances are assortative. 

Recently, stochastic blockmodelling has been applied in a classification con- 
text to overcome the need for assortativity assumptions |l|2j . In addition, the 
stochastic blockmodel can be used to understand the pattern of interactions 
between class instances by the way of a summary network of role interactions. 
This work describes three classification models of varying complexity based on 
the stochastic blockmodel along with efficient inference updates using collapsed 
variational Bayes. Their relative performance is investigated in various within- 
network classification cases. Finally, an example is given of the analysis that can 
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be conducted on the resulting model to better understand both the structure of 
the data and the classification decision. 

The main contribution of this work is the comparison between the models 
and the introduction of a model of intermediate complexity (section 3.2). A 
minor contribution is the new update equations (section [4| based on collapsed 
variational inference [3] which avoids the long running time and convergence 
diagnosis of the Gibbs sampling in [2] and the parameter updates and expensive 
Digamma function evaluations of the variational inference in pQ. 



2 Background 

Blockmodels have been used for social and psychometric analysis for decades 
|4|5|6j . The name refers to the "blocks" of zero and non-zero elements that oc- 
cur in the adjacency matrix when the rows and columns are reordered such that 
nodes with similar interaction patterns are adjacent. The clusters of nodes which 
make up these block patterns are known as network roles. Nodes belonging to 
the same role are equivalent to each other with respect to their probability of 
linking to nodes of other roles in the network. The pattern of interactions be- 
tween roles provides a summary of the interactions of the network. The original 
blockmodels were a-priori blockmodels where the assignment of nodes to roles 
was predetermined, usually according to the attributes of the nodes. Bayesian 
formulations of blockmodelling, known as stochastic blockmodelling [7], were de- 
veloped to create blockmodels by automatically inferring nodes roles according 
to the posterior distribution given the observed network links. The blockmod- 
elling paradigm then comes full circle with the supervised blockmodel as the 
roles inferred from the network structure are then used to predict the attributes 
of the network in a given classification problem. 

Stochastic blockmodels are usually used in an unsupervised context but can 
easily be transferred to the supervised setting by simply instantiating the roles 
of the nodes in the training set and inferring the remainder of the network as 
in [2]. In this case no extra variables are required as the roles and classes are 
equivalent and the inference procedure remains the same. This type of model, 
however, assumes that the classes are homogeneous in their linkage patterns and 
that all nodes of a particular class behave in the same way. To address this, 
an extension to the standard blockmodelling approach can be made based on 
supervised Latent Dirichlet Allocation (sLDA) [8] which, as the name suggests, is 
a supervised extension of the topic modelling approach LDA [9] . Latent Dirichlet 
Allocation is a method for clustering a corpus of documents into topics. The 
sLDA model extends the LDA approach to identify topics in documents which 
not only best describes the document structures but also to predict a known 
response variable (i.e. a classification or regression target) associated with each 
document. Similarly, a supervised blockmodel can be derived which identifies 
roles which both summarise the network structure and predict the class labels 
of nodes. By making a distinction between the roles and the classes it is possible 
to model heterogeneous linkage patterns within classes. Two such models are 
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presented here: one which assigns a single role to each node, and one which 
allows nodes to have multiple role memberships. 

3 Supervised Blockmodels 

This section describes the three variants of supervised blockmodels examined in 
this paper. 

3.1 Standard Stochastic Blockmodel 

A standard Stochastic Blockmodel (SBM) assumes the following generative pro- 
cess: 

1. For a given network draw a distribution over the K roles in the network 
~ Dirichlet(a) 

2. For each of the K x K possible role interactions: 

(a) Draw a probability of interacting itkxM ~ Beta(fa, fa) 

3. For each node in the network, v G {1, .., N}: 
(a) Draw a role z v ~ Categorical (6) 

4. For each of the N x N possible sender-receiver directed interactions, s, r: 
(a) Draw a binary value to indicate the presence or absence of a link e s ^ r ~ 

Bernoulli^ Zs , Zr ) 

The application of a standard Stochastic Blockmodel was demonstrated in 
[2] where the roles and classes are considered as the same thing. 

3.2 Supervised Single Membership Blockmodel 

The Supervised Single Membership Blockmodel (SSMB) is very similar to the 
SBM but introduces a separate class variable to allow for heterogeneity within 
classes, i.e. each class may have more than one role. 

1. For a given network draw a distribution over the K roles in the network 
~ Dirichlet(a) 

2. For each of the K x K possible role interactions: 

(a) Draw a probability of interacting 7r/ Cl? / C2 ~ Beta(fa, fa) 

3. For each role in the network, k G {1, ...,K}: 

(a) Draw a distribution over classes fi^ ~ Dirichlet(rj) 

4. For each node in the network, v G {1, .., N}: 

(a) Draw a role z v ~ Categorical (6) 

(b) Draw a class label y v ~ C ategorical{ii Zv ) 

5. For each of the N x N possible sender-receiver directed interactions, s, r: 
(a) Draw a binary value to indicate the presence or absence of a link e s?r ~ 

Bernoulli^ Zs ,z r ) 
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3.3 Supervised Mixed Membership Blockmodel 

The Supervised Mixed Membership Blockmodel (previously presented in pQ) 
extends the unsupervised mixed membership blockmodels from the literature 
|10|11|12] . The Supervised Mixed Membership Blockmodel (SMMB) assumes 
the following generative process: 

1. For a given network draw a distribution over the possible K 2 network role 
interactions tt ~ Dirichlet(a) 

2. For each role k G {1, 2, K}\ 

(a) Draw a distribution over nodes <p ~ Dirichlet(f3) 

3. For each interaction i G {1,2,...,/}: 

(a) Draw a role interaction pair Z{ = (z s ,z r )i, Z{ ~ Categorical (it) 

(b) Draw a sender node ~ Categorical((j) Zs ) 

(c) Draw a receiver node vi ~ Categorical((j) Zr ) 

4. For each node v e {1, 2, TV}: 

(a) Draw a class label y v ~ Softmax(r/, z v ) 

where z v = -7- V- z s .5 s . v + z r .5 r . w and z s . and z r . are the indicator vectors of 
length if describing the network role of the sender and receiver nodes in interac- 
tion z, 5 V is the Kronecker delta. z v therefore represents the empirical behavior 
class frequencies for node v. The softmax function provides the following distri- 
bution: 

p(Vv\ri,z v ) = exp(ri^ v z v )/^2 c exp(rj^z v ) 
4 Collapsed Variational Inference 

Inference of the network roles, z, can be efficiently computed using variational 
inference. Variational inference has the advantage over sampling methods due to 
convergence that is faster and easier to diagnose. Previous work has shown that 
the performance differences between inference methods can be minimal given 
appropriate hyperparameter settings [T3] . 

Variational Bayes [14] introduces an approximate variational posterior distri- 
bution, g, over the latent variables (roles) and model parameters (71", </>, 0). Usually 
this is a fully factorised distribution known as a mean-field approximation which 
provides a more tractable lower bound on the log evidence. 

logp(x|0) > T(q,0) =^[logp(z,x|0)] -E q [q(z)]. (1) 

By taking advantage of the conjugacy of the Dirichlet-Categorical and Beta- 
Bernoulli distributions, the model parameters can be integrated out exactly. 
This treatment yields the collapsed variational posterior, parameterised by the 
variational parameter A: 

q(z) = Y[q(z i \X i ), (2) 
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which provides a tighter bound on the evidence [3]. However, exact implementa- 
tion of collapsed variational Bayes is computationally too expensive and there- 
fore in practice a first order Taylor expansion is used to approximate the update 
equations. Further information on collapsed variational inference along with the 
first order approximation implemented here is given in |15|13| . The following 
sections detail the update equations for the 3 models. 



4.1 Standard Stochastic Blockmodel 

Inference in the standard Stochastic Blockmodel consists of sequentially updat- 
ing the variational posterior distribution over role assignments for each node 
according to: 

X V) k oc (rife + a) 



X : 



X 

k 



nS ((vf+^+ft+i) 
'nti to, + & +i) nS~ fv,h - d^ k2 + p 2 + i) \ {1 ~ Sk,k2) 



n 



X 



n,5 kx;+ft+ft+!) j 

' n n7 v — a u \ ( 1 ~ S k 1 ,k) 

*i v n3 +i) ) 

where d kl ,k 2 * s the count of links from role k\ to role &2, /u,fc is the number of 
times node v sender to a node of role k and similarly g Vjk is the number of time 
v is a receiver in an interaction with a node of role k. The totals for each role 
are given by n k . Collapsed variational inference involves removing the counts of 
the current node which is denoted by The nodes in the network used for 
training have their roles initialised to reflect their class (i.e. role=class) and the 
inference is carried out on the unlabelled nodes only. 



4.2 Supervised Single Membership Blockmodel 

The update procedure for the SSMB is almost identical to the SBM but incor- 
porates information about the class-role co-occurrence counts m c , k . 

Xv 1 kOc(n k + a) ^ nyvtk + ' n \ 
[m-:% + Cry) 

x nf=i fc+g " fc (g^ + l^2T~ fv ' k ~ 9v ' k ^ (W ti ) 2 - §^ + f3 2 + i) 

n^r +1 ((^r) 2 +/5i+A+o 

j-j / n£i fc to +_gi+o nS"'" fc K>r - » kltk -r ^ -r v . (4) 
^ v nSK>r+^+A + - x 



X 



X 
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In this model the roles are initialised randomly and the inference procedure 
sequentially updates each node in the network until convergence. Note that as 
opposed to the SBM model inference occurs over the training nodes too. 

Prediction of the unknown labels requires calculation of the parameter \i 
according to ([5|: 

T (m c , k + rj) , . 

Vc,k 7 n x (5) 

(m. jfc + Crj) 

Prediction of node class labels is done according to Q 

y* — arg max E q \fjbyZ v \ = arg max /J,y\ v > (6) 
ye{i,...,c} L J yE{l,...,C} 



4.3 Supervised Mixed Membership Blockmodel 

For the SMMB, role pairs (sender role and receiver role) are assigned to each 
interaction rather than assigning a single role to a node. Inference is therefore 
conducted by sequentially updating the variational posterior for each network 
interaction according to: 



A;,fci,fc 2 oc (dkl,k 2 + a k 1 ,k 2 ) 



x exp 



Vy Si ,ki hi jSij ki Vy ri ,k2 hi,ri,k 2 



h T X old 



h T X oU 



(7) 



where f v ^ (g Vj k) are the number of times node v is a sender (receiver) as role 
n v is the number of times node v is involved in an interaction and X v is 
a K-length vector representing the marginal probability of sender or receiver 
positions, i.e.: 



Ai,i,fc 2 ,y^ A 



i,K,k 2 



k 2 



(8) 



and hf v X v represents an approximation to the expectation under the q distribu- 
tion of 2 the normalising function of the soft max distribution for a node v. This 
h ijV is given by: 



Kv = £ exp ( n (E -p (^f) ) 

c x / j^i- \ k x ' / 



(9) 



v ^{ s 3> r j} 



rj is found using conjugate gradient to optimise the free energy terms of ([I]) 
corresponding to rj: 



{si,r;} 



(10) 
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where X v = ^- ^ A s .5 s . jV + A r .£ r .^. Conjugate gradient requires the following 
derivatives: 



djr [m:c] 
dj] c ,k 



It,.,., (E i A„ ( exp(^)) 



E 



Ec n^ g{Sl ,r l} (El V* eX P (^f )) 



^-\ v ,k exp 



— y^". (ll) 

Predicting unlabeled nodes requires inference of the network positions given 
the rest of the network. As the class label is unknown the inference is performed 
as above but without the terms involving rj. Classification of a test node is given 
by: 

c } \^ Z A=™ g y ^c/y~ Xv - (12) 



5 Experiments 
5.1 Data 

Networks generated from four real word datasets were examined in this work: 
a citation network, a feeding web network and two word networks. All of the 
networks are directed. Each of the datasets have a different underlying structure 
with respect to the given classification task. The first network is the Cora citation 
network [16], a popular dataset for collective classification comprising of 2708 
nodes representing scientific papers and 5429 links representing the citations 
between them. The classification task is to assign each paper one of 7 subject 
categories. 

The second network is a word network made up of the 112 most frequently 
occurring adjectives and nouns in Charles Dickens' novel David Copperfield [17]. 
The words are linked if they appear adjacent to each other in the text. 

The third network is also a word network, this time the Brown corpu^[l8 , 
which is a tagged corpus of present-day edited American English across various 
categories. In this work a network was created using words from the News cat- 
egory which occurred at least 10 times and were tagged as either verb, adverb, 
pronoun, noun or adjective. This resulted in a network of 990 words with 6157 
links between them. 

The forth network is a food web of 463 species in the Weddell Sea in the 
Antarctic where the 1939 edges point to each predator from its pre)|^] [19]. The 
variable used for classification is the feeding type which takes 6 values, namely 
primary producer, omnivorous, herbivorous/detrivorous, carnivorous, detrivo- 
rous, and carnivorous/detrivorous. 

1 Available from |http://nltk. googlecode.com/svn/trunk/nltk_data/index.xmll 

2 The Weddell Sea food web is available to download as part of larger dataset at 
|http: / / www.esapubs.org/archive/eco l/E086 / 135 /default .htm 
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5.2 Classification Performance 

Experiments were run to investigate the classification performance between the 
models and as the maximum number of roles, (if), was varied. Note that for the 
SBM the value of K was required to be fixed as equal to the number of classes 
(C). In all experiments 50% of the nodes were used for training. 

Performance is measured according to the macro- averaged Fl measure given 

by: 

2TP 

1 2TP + FN + FP ' 

where TP, FN, and FP correspond to the true positive, false negative and false 
positive rates respectively. The Fl measure represents the harmonic mean of 
the precision and recall values. For the multi-class problems the macro-average 
is used - i.e. the Fl score is calculated for each class and then averaged. This 
removes the bias in accuracy due to different class sizes in the datasets. Each 
experiment was run 100 times and the performance scores reported reflect the 
average over these runs. 

Figure [I] shows the performance of the three models across the different 
networks. For the Cora and David networks it can be seen that the SBM model 
performs well and that there is no real advantage to using one of the other 
models. The Cora network is highly assortative and the David network is highly 
disassortative, however, even though the networks are very different in structure 
they both contain homogeneous classes and so each class can be modelled with 
a single network role. 

The other two networks, News and Weddell, showed that the supervised 
(single and mixed membership) models offered a significant improvement over 
the SBM. This suggests that there exists some heterogeneity in the interaction 
patterns within classes. For the News dataset, the mixed membership model 
performs a little better than the single membership. On the other hand, in the 
Weddell dataset the single membership model performs a lot better than the 
mixed membership, however the single membership model has a much higher 
variance in performance compared to the more stable but less accurate mixed 
membership model. 

Figure [2] shows the average run times of the supervised blockmodels which 
is a factor of the computational complexity of the inference updates and the 
algorithm convergence times. It can be seen that the mixed membership (SMMB) 
model is has a significantly longer run time that the single membership model due 
to the computational cost of finding the Softmax parameters rj in the conjugate 



gradient step in (11) 



5.3 Summary Networks 

This section describes how the learned blockmodel can be used to understand 
the structure of the network with respect to the classes. The analysis will be 
on the mixed membership model although a similar analysis can be conducted 
with the SSMB. Focusing on the News network created from the Brown corpus, 
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Fig. 1. Classification performance of the 3 supervised blockmodel approaches as the 
parameter K (maximum roles) is varied. 
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Fig. 2. Average run times in seconds for the two supervised blockmodels as the pa- 
rameter K (maximum roles) is varied. 
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Figure [3] shows the summary network of how the identified network roles interact. 
The colour of the lines indicate the probability of observing a link type, where 
darker edges represent more likely interactions. Figure [4] shows a visualisation of 
the distribution over roles (columns) for each node (rows) in the News network. 
By ordering the nodes by class it is possible to get an overall picture of the 
relationship of classes and network roles. Figure [5] shows the distribution over 
classes for each of the 10 network roles. Using this information together it is 
possible to identify patterns in the connectivity of the classes and therefore in 
the ordering of the classes of words in the News corpus. For example, it can be 
seen that Roles 1 and 2 are usually verbs and that there is a chain of frequently 
co-occurring roles 4-2-5-1. Comparing the Roles 4 and 5 it can be seen that verbs 
(appearing more in 4 than 5) can come before other verbs but unlikely to be 
between two verbs. Pronouns and Noun can come before and between verbs and 
Adverbs are only associated with Verbs (they do not appear in any other network 
role). Other relationships that can be seen are that Nouns occur together (Roles 
7 and 9) and that Pronouns and Adjectives precede Nouns (Roles 10 and 8 
respectively) . 




Adject 




- 0.3 
0.2 



1 2 3 4 5 6 7 



Fig. 3. Summary network of the dis- 
covered role interactions in the word 
network created from News articles 
in the Brown Corpus. Darker arrows 
indicate higher frequency of interac- 
tion. 



Fig. 4. Node role assignment matrix for the 
News dataset. Each row represents a node and 
its posterior distribution over roles. Rows are 
ordered by class to highlight correspondence 
between class and position. 



6 Discussion 

This work has demonstrated how the pattern of interactions alone can be used 
to classify unlabelled instances in relational data. For simple cases this can be 
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Fig. 5. Distributions over classes for each role in the News word network from the 
Brown corpus. 



achieved with the well studied Stochastic Blockmodel (SBM) by considering the 
network roles as classes. In cases where classes exhibit heterogeneity in their 
interactions more complex models are required. A small modification to the 
Stochastic Blockmodel results in the Supervised Single Membership Blockmodel 
(SSMB) which can give significantly better classification performance. The Su- 
pervised Mixed Membership Model (SMMB) also performs well but does so at 
a significantly higher computational cost. Based on the few examples presented 
here it seems that the benefit of the mixed membership model is outweighed by 
its computational complexity, however more work is required to confirm this. Fi- 
nally, supervised blockmodels not only provide good classification performance 
but also an interpretable model to explore the structure of the data and the 
relationship between and within classes. 
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